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Abstract. There is a growing interest in using Kalman-filter models in brain 
modelling. In turn, it is of considerable importance to make Kalman-filters 
amenable for reinforcement learning. In the usual formulation of optimal con- 
trol it is computed off-line by solving a backward recursion. In this tech- 
nical note we show that slight modification of the linear-quadratic-Gaussian 
Kalman-filter model allows the on-line estimation of optimal control and makes 
the bridge to reinforcement learning. Moreover, the learning rule for value esti- 
mation assumes a Hebbian form weighted by the error of the value estimation. 



1. Motivation 

Kalman filters and their various extensions are well studied and widely applied 
tools in both state estimation and control. Recently, there is an increasing inter- 
est in Kalman-filters or Kalman-filter like structures as models for neurobiological 
substrates. It has been suggested that Kalman-filtering (i) may occur at sensory 
processing jB]|7j , (ii) may be the underlying computation of the hippocampus, and 
may be the underlying principle in control architectures |5] . Detailed architec- 
tural similarities between Kalman-filter and the entorhinal-hippocampal loop as 
well as between Kalman-filters and the neocortical hierarchy have been described 
recently PIE]. Interplay between the dynamics of Kalman-filter- like architectures 
and learning of parameters of neuronal networks has promising aspects for explain- 
ing known and puzzling phenomena, such as priming, repetition suppression and 
categorization ^JQ. 

As it is well known, Kalman-filter provides an on-line estimation of the state 
of the system. On the other hand, optimal control cannot be computed on-line, 
because it is typically given by a backward recursion (the Ricatti-equations). For 
on-line parameter estimations without control aspects, see |S]- 

The aim of this paper is to derive an on-line control method for the Kalman-filter 
and achieve optimal performance asymptotically. Slight modification of the lincar- 
quadratic-Gaussian (LQG) Kalman-filter model is introduced for treating the LQG 
model as a reinforcement learning (RL) problem. 



2. the Kalman filter and the LQG model 

Consider a linear dynamical system with state x t € E™, control u t £ K m , obser- 
vation y t £ M. k , noises w t £ M. n and e t £ M. k (which are assumed to be Gaussian 
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and white, with covariance matrix Q w and f2 e , respectively), in discrete time t: 

(1) x t+ i = Fx ( + Gut + w, 

(2) y t = Hx t + e t , 

the initial state has mean xi and covariance Si. Executing the control step u t in 
x t costs 

(3) c(x t , u t ) := xf Qx t + ufRu t , 

and after the iVth step the controller halts and receives a final cost of xJ^QnX-n- 
This problem has the well known solution 

(4) x t+1 = Fx t + Gu t + K t (y t -Hx t ) 

(5) K t = F^ t H T {H^ t H T + fl e r 1 

(6) E t+ i = n w + FY* t F T - K t HT, t A T (state estimation) 
and 

(7) u t = -L t x t 

(8) L t = (G T S t+1 G + R)- 1 G T S t+1 F 

(9) S t = Qt + F T S t+ iF - F T S t+ iGL t . (optimal control) 

Unfortunately, the optimal control equations are not on-line, because they can 
be solved only by stepping backward from the final, iVth step. 



3. Kalman Filtering in the Reinforcement Learning Framework 

First of all, we slightly modify the problem: the run time of the controller 
will not be a fixed number N. Instead, after each time step, the process will be 
stopped with some fixed probability p (and then the controller incurs the final cost 
c/(x/) : x'fQjXfL 

3.1. The cost-to-go function. Let (x) be the optimal cost-to-go function at 
time step t, i.e. 

(10) V t *(x) := inf E[c(x t , u t ) + c(x t+ i, u t +i) + . . . + c/(x/)|x t = x] . 

l»t,Ut+l,... 

Clearly, for any x, 

(11) V t *(x) = p ■ c f (x) + (1 - p) ■ inf (c(x, u) + E w [V t * +1 (Fx + Gu + w)] ) 

It can be easily shown that the optimal cost-to-go function is time- independent, 
furthermore, it is a quadratic function of x, that is, it is of the form 

(12) U*(x)=x T rTx. 

Our task is to estimate V* (in fact, the parameter matrix II*) on-line. This will be 
done by value iteration. 
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3.2. Value iteration, greedy action selection and the temporal differ- 
encing error. Value iteration starts with an arbitrary initial cost-to-go function 
Vo( x ) = x T n x. After this, control actions are selected according to the current 
value function estimate, the value function is updated according to the experience, 
and these two steps are iterated. 

The tth estimate of V* is V t (x) = x T n t x. The greedy control action according 
to this is given by 



(13) u t = argmm(c(x t ,u)+£[V^x t + Gu + u;)]J 

(14) = argmin(u T .Ru + (Fx* + Gu) T II t (Fx t + Gu) 

(15) = -(i? + G T n t G)- 1 (G T n t F)x t . 



For the sake of simplicity, the cost-to-go function will be updated by using the 
1-step temporal differencing (TD) method. Naturally, it can be substituted with 
more sophisticated methods like multi-step TD or eligibility traces. The TD error 
is 

(16) 

1 14(x t ) — c/(x t ) if the controller was stopped at the tth time step, 

* [(c(x t ,u t ) + V t (x t+ i)) - Vt(x t ), otherwise. 

and the update rule for the parameter matrix n t is 

(17) n t+ i - n t + af<5fVn t V r t (xt) 

(18) = n t + at ■ S t ■ x t xf , 

where at is the learning rate. Note that value-estimation error weighted Hebbian 
learning rule has emerged. 

4. Concluding remarks 

The Kalman-filter control problem was slightly modified to fit the RL framework 
and an on-line control rule was achieved. The well-founded theory of reinforcement 
learning ensures asymptotic optimality for the algorithm. The described method is 
highly extensible. There are straightforward generalizations to other cases, e.g., to 
extended Kalman filters, dynamics with unknown parameters, non-quadratic cost 
functions, or more advanced RL algorithms, e.g. eligibility traces. For quadratic 
loss functions, we have found that learning is Hebbian and it is weighted by the 
error of value-estimation. 
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